Executive Summary

This project applies predictive modeling techniques to analyze cardiovascular health, focusing on two key objectives: predicting maximum heart rate achieved (thalach) and classifying the presence of heart disease. The analysis begins with an exploratory data assessment to understand variable distributions and relationships before implementing a range of regression and classification models.

For thalach prediction, multiple regression approaches were explored, including linear regression, Lasso regression, and regression trees. Among these, the tuned regression tree model emerged as the most effective, achieving the lowest RMSE (14.56) and MAE (11.28) and the highest R² (0.56), significantly outperforming linear models, which explained only 39% of the variance. The Variable Importance Plot (VIP) revealed that ST segment (slope), age, and ST depression (oldpeak) were the most influential predictors of thalach, reinforcing their significance in cardiovascular assessments.

For heart disease classification, both logistic regression and classification trees were evaluated. The classification trees outperformed logistic regression, achieving 93% accuracy, 91% sensitivity, and 95% specificity and precision. The ROC Curve confirmed this distinction, with the classification tree achieving an AUC of 0.98. The Variable Importance Plot (VIP) identified chest pain type (cp), thalassemia status (thal), maximum heart rate (thalach) as the most critical factors in predicting heart disease, aligning with established medical risk indicators.

A key observation was that adjusting the classification threshold to 0.59 had minimal impact, as most predicted probabilities remained below this value, leading to unchanged classification results. This highlighted the importance of careful threshold selection and model tuning to optimize classification performance.

In conclusion, this project demonstrates the effectiveness of regression tree models for predicting maximum heart rate and classification trees for identifying heart disease. The findings emphasize the importance of feature selection, model complexity, and cutoff optimization in improving predictive accuracy and clinical relevance.

The Project
The Problem Description

According to CDC, one person dies every 33 seconds from cardiovascular disease. Heart disease poses a significant health risk, making it crucial to identify its risk factors to implement proactive measures before it becomes critical. This report seeks to identify the key risk factors contributing to heart disease through an analysis of a dataset. This data collection is dated 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V representing individual patient records. This dataset is designed to help predict the presence of heart disease based on various medical, demographic, and diagnostic information about the patients.

For this analysis, I will first examine the distribution of the variables and look for relationships. Second, I will conduct regression analysis to predict maximum heart rate achieved (thalach), as it is a key indicator of cardiovascular fitness and heart function. This will include a variety of methods, such as linear regression, lasso regression and regression trees to determine the most significant predictors and identify the most effective model for understanding how different factors influence thalach. Third, I will perform logistic regression using backward elimination and classification trees to predict the presence of heart disease. This is a classification task, where the dependent variable (target) indicates whether heart disease is present or absent based on multiple attributes. Finally, I will end with summarizing the conclusions and reflections.

The Data

This dataset has 1026 rows and 14 variables.

Data Sources

The data set used is from Kaggle (2019), titled Heart Disease Dataset, provided by David Lapp.

Variables
TO PREDICT WITH
  • age: The age of the patient (in years) (Continuous)
  • sex: Gender of the patient: 0 = Female, 1 = Male (Categorical)
  • cp: Type of chest pain experienced: 1 = Typical angina (linked to heart issues), 2 = Atypical angina, 3 = Non-anginal pain, 4 = Asymptomatic (Categorical)
  • trestbps: Resting blood pressure in mm Hg (Continuous)
  • chol: Serum cholestoral in mg/dl (Continuous)
  • fbs: Fasting blood sugar > 120 mg/dl: true (1) or false (0). Binary indicator (Categorical)
  • restecg: Resting electrocardiogram results: 0 = Normal, 1 = ST-T wave abnormality, 2 = Left ventricular hypertrophy (heart muscle issues) (Categorical)
  • thalach: The maximum heart rate achieved by the patient during exercise (Continuous)
  • exang: Binary indicator for exercise-induced angina: 1 = Yes, 0 = No (Categorical)
  • oldpeak: ST depression induced by exercise relative to rest (Continuous)
  • slope: Slope of the peak exercise ST segment: 1 = Upsloping (normal), 2 = Flat (abnormality), 3 = Downsloping (risk) (Categorical)
  • ca: Number of major vessels (0-3) visible in fluoroscopy (Categorical)
  • thal: Thalassemia status: 3 = Normal, 6 = Fixed defect, 7 = Reversible defect. (Categorical)
  • target: Presence of heart disease: 0= No, 1 = Yes. (Categorical)
WE WANT TO PREDICT
  • thalach: The maximum heart rate achieved by the patient during exercise
  • target: Presence of heart disease: 0= No, 1 = Yes
Data Overview

This dataset contains a variety of medical, demographic, and diagnostic attributes that help in predicting heart disease. I converted 8 categorical variables (sex, cp, fbs, restecg, exang, slope, ca, and thal) to factors.

Within the dataset, we can see that the age of patients ranges from 29 to 77 years, with a mean of approximately 54 years, reflecting a relatively young population. Additionally, there is a notable gender disparity, with 713 male patients compared to 312 female patients. Several critical cardiovascular indicators exhibit wide variability. For instance, ST depression (oldpeak), which measures heart stress during exercise, ranges from 0 to 6.2, with higher values indicating more severe heart conditions. The maximum heart rate achieved (thalach) spans from 71 to 202 beats per minute, with a median of 152 bpm.

Looking at the overall distribution of heart disease cases, 526 out of 1025 patients (51.3%) have a heart condition, suggesting that more than half of the individuals in the dataset are affected. This highlights the importance of analyzing key risk factors that contribute to heart disease.

View the Data Summaries

Below is the range of values for each variable.

      age        sex     cp         trestbps          chol     fbs     restecg
 Min.   :29.00   0:312   0:497   Min.   : 94.0   Min.   :126   0:872   0:497  
 1st Qu.:48.00   1:713   1:167   1st Qu.:120.0   1st Qu.:211   1:153   1:513  
 Median :56.00           2:284   Median :130.0   Median :240           2: 15  
 Mean   :54.43           3: 77   Mean   :131.6   Mean   :246                  
 3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:275                  
 Max.   :77.00                   Max.   :200.0   Max.   :564                  
    thalach      exang      oldpeak      slope   ca      thal    target 
 Min.   : 71.0   0:680   Min.   :0.000   0: 74   0:578   0:  7   0:499  
 1st Qu.:132.0   1:345   1st Qu.:0.000   1:482   1:226   1: 64   1:526  
 Median :152.0           Median :0.800   2:469   2:134   2:544          
 Mean   :149.1           Mean   :1.072           3: 69   3:410          
 3rd Qu.:166.0           3rd Qu.:1.800           4: 18                  
 Max.   :202.0           Max.   :6.200                                  
Check Null values
     age      sex       cp trestbps     chol      fbs  restecg  thalach 
       0        0        0        0        0        0        0        0 
   exang  oldpeak    slope       ca     thal   target 
       0        0        0        0        0        0 

There are no missing values.

Heart Disease Barplot

Patients with heart disease is slightly more than those without heart disease (48.7% vs 51.3%)

Response Variables Relationships with Predictors
  • Relation between Age and Heart Disease: Heart disease (target = 1) appears to affect individuals across a broader age range, with a tendency toward younger ages compared to those without heart disease. On the other hand, the absence of heart disease (target = 0) is more common among older individuals in this dataset.

  • Relation between Gender and Heart Disease: Heart disease is observed in both genders but is more frequent among males. While females have a smaller population in the dataset, the proportion with heart disease seems relatively high compared to males. This suggests that heart disease affects both genders significantly but may impact males more in this dataset.

  • Relation between Sex, Age and Heart Disease: Among patients without heart disease (target = 0), there is a relatively even spread of males and females, though males appear slightly more frequent. However, for patients with heart disease (target = 1), males (blue) are much more prevalent than females (red), suggesting that men in this dataset are at a higher risk of developing heart disease. Additionally, age does not show a clear separation between those with and without heart disease, as patients with heart disease span a broad age range, indicating that age alone may not be the strongest predictor.

  • Relation between Thalach, Age and Heart Disease: Higher maximum heart rates are associated more frequently with the presence of heart disease (target = 1), and younger individuals may achieve higher heart rates compared to older individuals.

Relation between Age and Heart Disease
Relation between Sex, Age and Heart Disease
Relation between Gender and Heart Disease
Relation between Thalach, Age and Heart Disease
Correlation Matrix

The correlation matrix indicates that chest pain type (cp) and maximum heart rate achieved (thalach) have the strongest positive correlations with target, making them critical predictors of heart disease. Conversely, oldpeak (ST depression), number of major vessels (ca) and exercise-induced angina (exang) show strong negative correlations with target, highlighting their importance in predicting heart disease absence. There are no signs of multicollinearity among the independent variables, as no correlations between predictors exceed the threshold of ±0.6, ensuring stable regression models.

The histogram displays the distribution of maximum heart rate achieved (thalach) across the dataset. The distribution appears slightly right-skewed, with the majority of values concentrated between 100 and 180 bpm, and a peak around 150 bpm, indicating that most individuals in the dataset reach a heart rate in this range. There are relatively fewer cases at the lower and higher extremes. The presence of multiple peaks suggests some variability in the data, potentially due to differences in age, fitness level, or health conditions among the patients.


    Welch Two Sample t-test

data:  thalach by target
t = -14.862, df = 976.86, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -22.02427 -16.88631
sample estimates:
mean in group 1 mean in group 2 
       139.1303        158.5856 

The extremely small p-value suggests that the difference in means is highly statistically significant. This means the observed difference is unlikely to be due to random chance and likely represents a real effect in the population.

Regression Summary

For the prediction of the continuous variable value (thalach) we will use linear regression. The results are summarized in this section.

After examining the final model, we can observe some issues in the residual plots that indicate potential concerns with our data. The homogeneity of variance plot shows a curved pattern, suggesting that the residuals are not evenly distributed across the fitted values, meaning that the model may not be capturing the variability in thalach equally for all predictions. The normality of residuals plot indicates some deviation from normality, particularly in the tails, which impact the validity of conclusions.

Additionally, comparing the full and pruned models, we see that removing predictors with high p-values did not significantly improve the fit, as the R² remained relatively low (around 0.39), and the RMSE and MAE values remained nearly the same. This suggests that the linear neither the pruned model may not be the best fit for predicting thalach, and alternative approaches such as tree-based methods may improve predictive accuracy.

Effect on Thalach by the Predictor Variables
Variable Direction
age Decrease
cp Increase
trestbps Increase
chol Increase
exang Decrease
slope Increase
target Increase
Analysis Summary

We can see that our initial linear model achieves an R-squared of 39%, indicating that it explains a small to moderate portion of the variability in the response variable. Analyzing the residual plots suggests that the residuals are mostly normal, though there is a slight curvature in the residual vs. fitted values plot and some skew in the residual distribution.

After examining the model coefficients, we determine that some predictors do not significantly contribute to predicting the thalach (p values > 0.05). Consequently, we create a pruned model by removing these less significant predictors. This pruning step aims to simplify the model, potentially improving interpretability and the model in general.

model RMSE MAE RSQ
Linear Model Full 17.89 13.95 0.39
The Full Regression Model Coefficients
term estimate std.error statistic p.value
(Intercept) 140.22 9.05 15.50 0.00
age -0.84 0.07 -12.05 0.00
sex 0.75 1.36 0.55 0.58
cp 2.35 0.64 3.66 0.00
trestbps 0.12 0.03 3.59 0.00
chol 0.03 0.01 2.92 0.00
fbs 2.30 1.65 1.40 0.16
restecg -1.24 1.10 -1.13 0.26
exang -9.32 1.41 -6.62 0.00
oldpeak -0.73 0.63 -1.15 0.25
slope 8.07 1.15 7.04 0.00
ca -0.54 0.62 -0.87 0.39
thal 1.96 0.98 1.99 0.05
target 7.60 1.60 4.76 0.00
Analysis Summary

The analysis of the pruned model reveals potential concerns with heteroscedasticity and non-linearity, which may impact the model’s reliability. The homogeneity of variance plot (residuals vs. fitted values) shows a curved trend line rather than a flat horizontal one, suggesting that the variance of residuals is not constant across predicted values. This heteroscedasticity implies that the model may perform better for some ranges of fitted values than others, leading to unreliable predictions.This suggests that the relationship between the predictors and the dependent variable cannot be fully captured by a simple linear model. On the other hand, the normality of residuals plot indicates that residuals are approximately normal with some skew. As the pruned model removes unnecessary predictors, it might be preferable for simplicity and interpretability.

model RMSE MAE RSQ
Linear Model Full 17.894 13.951 0.394
Linear Pruned Final Model 17.980 14.036 0.389
The Final Regression Model Coefficients
term estimate std.error statistic p.value
(Intercept) 145.18 7.03 20.66 0
age -0.85 0.07 -12.59 0
cp 2.50 0.64 3.93 0
trestbps 0.13 0.03 3.80 0
chol 0.04 0.01 3.16 0
exang -9.06 1.39 -6.51 0
slope 8.67 0.99 8.75 0
target 7.23 1.40 5.16 0
Compare actual (thalach) vs predicted (y_hat) for pruned regression model
Lasso Regression Summary

After evaluating the data using linear regression, there was potential issues with non-linearity and heteroscedasticity, which affected the model’s predictive performance. To better capture complex relationships between predictors and the target variable, I implemented a lasso regression model with penalty of 0.1 and a lasso tuned regression model with trained/test data.

The analysis of the Lasso regression model highlights the impact of regularization strength on model performance. Excessive penalization (lambda values of 10 or 250) led to over-regularization. By adjusting the penalty to 0.1, the model retained important predictors while still mitigating overfitting. The residual vs. predicted plot indicates that while the model performs reasonably well, there may be some heteroscedasticity, as residuals are not uniformly distributed across predicted values. This suggests that certain ranges of the target variable may be predicted with greater accuracy than others, potentially impacting reliability.

The final lasso tuned model with training/test data gets alike metrics but as the model with λ = 0.1 has higher R² while maintaining similar numbers of RMSE and MAE, it is the better choice. This model balances regularization and predictive power, making it preferable over the more constrained λ = 1 model, which shrinks coefficients more aggressively and slightly reduces prediction accuracy.

Effect on Thalach by the Predictor Variables by Lasso Regression Penalty 0.1
Variable Direction
age Decrease
sex Increase
cp Increase
trestbps Increase
chol Increase
fbs Increase
restecg Decrease
exang Decrease
olpeak Decrease
slope Increase
ca Decrease
thal Increase
target Increase
Analysis Summary

We can see that our Lasso regression model with a penalty of 0.1 achieves an R-squared of 39%, indicating that it explains a small to moderate portion of the variability in the response variable. Analyzing the residual plots suggests that while the residuals are mostly normal, there is some heteroscedasticity present, as the residuals are not uniformly distributed across predicted values.

Examining the model coefficients reveals that some predictors contribute more significantly than others, with certain variables having been shrunk close to zero due to the Lasso penalty. Initially, using a penalty of 10 or 250 caused the model to be overly restrictive, shrinking all coefficients excessively and resulting in nearly identical predictions. This indicated excessive regularization, leading to a loss of meaningful relationships between predictors and the target variable.

To address this issue, a lower penalty of 0.1 was selected, allowing the model to retain important predictors while still applying regularization to mitigate overfitting. This adjustment balances interpretability and predictive performance.

model RMSE MAE RSQ
Linear Model Full 17.894 13.951 0.394
Linear Pruned Final Model 17.980 14.036 0.389
Linear Pruned Final Model 17.980 14.036 0.389
Lasso Regression Penalty 0.1 17.897 13.948 0.394
Table of Coefficients

The Lasso regression model with a penalty of 0.1 retains key predictors while applying regularization to prevent overfitting. Among the most influential variables, age (-7.47) and exang (-4.32) show strong negative relationships, indicating that an increase in age and exercise-induced angina is associated with a lower predicted outcome. In contrast, cp (2.41) and thal (1.07) have positive coefficients, suggesting that chest pain type and thalassemia classification contribute to higher predictions.

term estimate penalty
(Intercept) 149.11 0.1
age -7.47 0.1
sex 0.20 0.1
cp 2.41 0.1
trestbps 2.05 0.1
chol 1.62 0.1
fbs 0.72 0.1
restecg -0.57 0.1
exang -4.32 0.1
oldpeak -0.81 0.1
slope 4.95 0.1
ca -0.46 0.1
thal 1.07 0.1
target 3.70 0.1
Compare actual (thalach) vs predicted
Analysis Summary

During the tuning process for the Lasso regression model, I initially encountered an issue where all predictions were constant, leading to errors in computing correlation-based metrics. This occurred because the penalty (lambda) was too high, causing excessive shrinkage of the coefficients and reducing model variability. By adjusting the penalty range from 0.00001 to 0.1, I allowed the model to retain more important predictors while still applying regularization to prevent overfitting. To ensure a robust evaluation, I implemented a train-test split (80% and 20%). This approach prevented data leakage and provided a realistic assessment of the model’s performance on unseen data. I selected the penalty value corresponding to the lowest MAE to optimize the model’s predictive performance. The Mean Absolute Error (MAE) represents the average absolute difference between predicted and actual values, making it a crucial metric for assessing model accuracy. I ensured that the final model was tuned to minimize prediction errors while maintaining regularization.

# A tibble: 1 × 2
  penalty .config              
    <dbl> <chr>                
1    1.00 Preprocessor1_Model01
model RMSE MAE RSQ
Linear Model Full 17.894 13.951 0.394
Linear Pruned Final Model 17.980 14.036 0.389
Lasso Regression Penalty 0.1 17.897 13.948 0.394
Lasso Regression Tuned 17.518 13.732 0.353
View the Coefficient table and Variable Importance

The Tuned Lasso regression model applied regularization, reducing some coefficients to zero, indicating they were not strong predictors. Key influential variables include age (-6.299) and exang (-3.717), which negatively impact the target, while slope (4.537) and target (3.196) have strong positive effects. We can see this in the Variable Importance Plot.

term estimate penalty
(Intercept) 149.114 1
age -6.299 1
sex 0.000 1
cp 2.062 1
trestbps 0.947 1
chol 0.593 1
fbs 0.000 1
restecg 0.000 1
exang -3.717 1
oldpeak -0.424 1
slope 4.537 1
ca 0.000 1
thal 0.000 1
target 3.196 1
Compare actual (thalach) vs predicted
Regression Tree Summary

After evaluating the data using linear and lasso regression, there was potential issues with heteroscedasticity, which affected the model’s predictive performance. To better capture complex relationships between predictors and the target variable, I implemented a regression tree model and a tuned regression tree model that can automatically detect interactions and split the data into meaningful segments, improving prediction accuracy.

Initially, the first regression tree resulted in improved performance over the linear models. The regression tree model achieved an RMSE of 15.50 and an R² of 0.55, compared to the linear regression and lasso regression model’s RMSE of 17.89 and R² of 0.39. This suggests that the tree model captures more variation in the data than linear regression.

However, to further enhance performance, I tuned the regression tree with training/test split (80% and 20%) by optimizing the cost complexity parameter and tree depth using cross-validation. The tuned regression tree model significantly outperformed the previous models, with an RMSE of 14.56 an MAE of 11.28, and an R² of 0.56. The actual vs. predicted scatter plots show a stronger correlation in the tuned tree, meaning it generalizes better. By refining the model through tuning, we achieved higher predictive power while maintaining interpretability.

Analysis Summary

I will predict thalach with all the variables. The regression tree model shows better performance than the linear and lasso models.

model RMSE MAE RSQ
Linear Model Full 17.894 13.951 0.394
Linear Pruned Final Model 17.980 14.036 0.389
Lasso Regression Penalty 0.1 17.897 13.948 0.394
Lasso Regression Tuned 17.518 13.732 0.353
Regression Tree Model 15.504 12.140 0.545
View the Regression Tree and Variable Importance

The regression tree has 15 leaf nodes. The variable importance plot shows that the top 3 most important features is slope (most influential), age and oldpeak.

Compare actual (thalach) vs predicted (y_hat)
Analysis Summary

To see if tuning improve performance, I will use cross validation on the cost complexity and the tree depth. After tuning (39 leaf nodes, adjusted complexity parameters), the regression tree achieved the lowest RMSE and highest R², demonstrating that the model now explains 56% of the variance in the target variable.

model RMSE MAE RSQ
Linear Model Full 17.894 13.951 0.394
Linear Pruned Final Model 17.980 14.036 0.389
Lasso Regression Penalty 0.1 17.897 13.948 0.394
Lasso Regression Tuned 17.518 13.732 0.353
Regression Tree Model 15.504 12.140 0.545
Tuned Regression Tree Model 14.557 11.279 0.557
Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = 1e-10
  tree_depth = 6

Computational engine: rpart 

Model fit template:
rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
    cp = 1e-10, maxdepth = 6L)
View the Regression Tree and Variable Importance

The regression tree has 39 leaf nodes. Slope, age, and oldpeak are the dominant predictors for thalach, suggesting these factors have a strong relationship with a person’s maximum heart rate during exercise. The presence of heart disease (target) also plays a role, reinforcing the medical link between cardiovascular conditions and heart rate response.

Compare actual (thalach) vs predicted (y_hat) tuned tree
Logistic Summary: Predicting Heart Disease

For the final model, I will use logistic regression to explore heart disease presence. From the variable importance plot, we can see that males, number of major vessels (ca2, ca3, ca4), chest pain type (cp2, cp3, cp4), and ST depression (oldpeak) are among the most significant predictors in the model. These variables play a critical role in determining heart disease risk, aligning with medical insights that suggest factors like chest pain, blood vessel count, and exercise-induced abnormalities are strong indicators of cardiovascular health. The ROC curve confirms that our logistic model performs well, with an AUC of 0.91, and adjusting the cutoff to 0.74 provides a balance between sensitivity (87%) and specificity (82%).

Pruned Logistic Regression Equation

From the logistic regression equation, we observe that ca5 is a non-significant predictor with a p-value of 0.73. However, as a categorical predictor by nature, I decided to not make modifications as ca5 can be combined with the base category. It simplifies interpretation without losing valuable information.

term estimate std.error statistic p.value
(Intercept) -2.83 1.15 -2.46 0.01
sex2 2.12 0.26 8.20 0.00
cp2 -1.28 0.29 -4.41 0.00
cp3 -1.97 0.25 -7.78 0.00
cp4 -2.17 0.36 -6.07 0.00
trestbps 0.02 0.01 3.74 0.00
chol 0.01 0.00 3.20 0.00
thalach -0.03 0.01 -4.84 0.00
exang2 0.94 0.23 4.07 0.00
oldpeak 0.64 0.11 5.84 0.00
ca2 1.93 0.25 7.80 0.00
ca3 2.57 0.35 7.41 0.00
ca4 2.06 0.46 4.46 0.00
ca5 -0.28 0.82 -0.34 0.73
Metrics and VI Plot
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Logistic Model 0.88 0.92 0.84 0.88 0.91
Pruned Logistic Model 0.81 0.66 0.96 0.81 0.75
          Truth
Prediction Yes  No
       Yes 504 172
       No   22 327
View the ROC Curve
Best Threshold
Best_Cutoff Sensitivity Specificity AUC_for_Model
0.74 0.87 0.82 0.91
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Logistic Model 0.88 0.92 0.84 0.88 0.91
Pruned Logistic Model 0.81 0.66 0.96 0.81 0.75
Logistic Model Cutoff 0.74 0.84 0.82 0.87 0.84 0.83
Classification Models

When predicting the presence of heart disease (target = 1), I coded it so that “Yes” indicates a diagnosis of heart disease, while “No” indicates no heart disease. For this analysis, I performed both classification trees and logistic regression models to compare their predictive performance. Both models in the classification trees achieved a sensitivity of 91%, meaning they correctly identified 91% of individuals with heart disease. Additionally, they demonstrated high specificity (95%) and an overall accuracy of 93%, making it a strong predictive tool. When adjusting the classification threshold, we observed that modifying the cutoff to 0.59 had minimal impact on classification results. This occurred because most predicted probabilities were below 0.59, leading to similar classifications and unchanged performance metrics.The precision is 95% and the AUC is 98%. Overall, both models have a strong predictive performance in detecting heart disease.

Classification Tree Summary

I will use all the variables. For this model the cost complexity is set to .001.

model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Logistic Model 0.88 0.92 0.84 0.88 0.91
Pruned Logistic Model 0.81 0.66 0.96 0.81 0.75
Logistic Model Cutoff 0.74 0.84 0.82 0.87 0.84 0.83
Classification Tree Model 0.93 0.91 0.95 0.93 0.95
          Truth
Prediction Yes  No
       Yes 481  23
       No   45 476
View the Classification Tree and Variable Importance

The classification tree has 32 leaf nodes, each representing a final decision point in the model. The more splits the tree has, the more complex its decision-making process becomes. This tree structure helps classify individuals based on various health indicators, ultimately predicting whether a person has heart disease or not. Looking at the Variable Importance Plot (VIP), the higher the importance value, the greater the influence of the variable on the model’s decision-making process. In this case, chest pain type (cp) is the most influential predictor, followed by thalassemia status (thal) and maximum heart rate achieved (thalach). These variables significantly impact the likelihood of heart disease, as they are key indicators used in medical assessments.

View the ROC Curve
Best Threshold

The summary of descriptive statistics shows that while the model makes varied predictions, most probabilities are clustered around 0.5, with only the top 25% exceeding 0.97. Since half of the predictions are below 0.46, raising the classification cutoff to 0.59 has minimal impact, as many cases remain classified as “No.” While some cases are confidently predicted as “Yes,” they are relatively rare, leading to unchanged classification results and performance metrics.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.01376 0.46154 0.51317 0.97083 1.00000 
Best_Cutoff Sensitivity Specificity AUC_for_Model
0.59 0.91 0.95 0.98
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Logistic Model 0.88 0.92 0.84 0.88 0.91
Pruned Logistic Model 0.81 0.66 0.96 0.81 0.75
Logistic Model Cutoff 0.74 0.84 0.82 0.87 0.84 0.83
Classification Tree Model 0.93 0.91 0.95 0.93 0.95
Classification Tree Model Cutoff 0.59 0.93 0.91 0.95 0.93 0.95
Summary and Reflection

From the first analysis, I can confidently recommend tuned regression tree model for predicting thalach (maximum heart rate achieved). The tuned regression tree model achieves the lowest RMSE (14.56) and MAE (11.28) while achieving the highest R² (0.56), indicating that it explains 56% of the variance in the model, significantly more than the linear models (39%). The predicted vs. actual plot further supports this conclusion, where the tuned regression tree model (purple) aligns more closely with the ideal diagonal line, suggesting better prediction accuracy. A key takeaway from the Variable Importance Plot (VIP) is that ST segment (slope), maximum heart rate (age), ST depression (oldpeak), are among the most influential predictors for modeling thalach. These insights suggest that understanding these variables can enhance decision-making related to cardiovascular health.

From the second analysis, we compared the classification tree model and logistic regression model to predict the presence of heart disease. The classification tree cutoff 0.6 consistently outperforms logistic regression in terms of accuracy (93%), sensitivity (91%), specificity (95%), and precision (95%), ensuring more accurate identification of individuals with and without heart disease. The ROC Curve confirms this distinction, with the classification tree having a higher AUC (0.98) compared to logistic regression. For the classification model, the Variable Importance Plot (VIP) highlights the most significant predictors of heart disease. The top predictors are chest pain type (cp), thalassemia status (thal) and maximum heart rate (thalach). These factors strongly influence the classification of heart disease cases and align with established medical risk factors.

Reflection: One of the aspects I am most proud of in this project is my ability to work with a real-life dataset and analyze it using R. Given that this was my first experience coding and using a programming language, I feel a great sense of accomplishment in understanding data manipulation, modeling, and interpretation in R. Throughout the project, I built confidence in my ability to apply statistical techniques, and I am excited to use these skills in future projects.

If I had another week to work on the project, I would focus on enhancing the predictive power of my models. One approach in the logistic regression would be combining predictors, such as integrating ca5 with the base model, to explore whether it improves classification performance. Additionally, I would experiment with other methods, such as Random Forest or Boosting, to compare their accuracy and robustness against the classification tree and logistic regression models. These methods could potentially improve generalizability and further optimize sensitivity and specificity in predicting heart disease.

Predicting Continuous Thalach Value

In addition, if I compare the models for predicting thalach, we see that the pruned regression tree further enhances model performance, reducing MAE to 11.28. The R square is 56% and RMSE of 14.56.

model RMSE MAE RSQ
Linear Model Full 17.894 13.951 0.394
Linear Pruned Final Model 17.980 14.036 0.389
Lasso Regression Penalty 0.1 17.897 13.948 0.394
Lasso Regression Tuned 17.518 13.732 0.353
Regression Tree Model 15.504 12.140 0.545
Tuned Regression Tree Model 14.557 11.279 0.557
Compare actual (Thalach) vs predicted (y_hat) tuned tree
Predicting Categorical Target Value

In predicting Heart Disease, the classification tree has higher precision (95%) for predicting heart disease cases (Yes), ensuring fewer false positives. It also achieves higher sensitivity (91%), meaning it correctly identifies more individuals with heart disease. Meanwhile, the logistic regression model has a lower sensitivity, specificity and precision. Given these results, the classification tree model is better.

model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Logistic Model 0.88 0.92 0.84 0.88 0.91
Pruned Logistic Model 0.81 0.66 0.96 0.81 0.75
Logistic Model Cutoff 0.74 0.84 0.82 0.87 0.84 0.83
Classification Tree Model 0.93 0.91 0.95 0.93 0.95
Classification Tree Model Cutoff 0.59 0.93 0.91 0.95 0.93 0.95
ROC Curves